{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "markdown"
    }
   },
   "source": [
    "# Document Augmentation RAG with Question Generation\n",
    "\n",
    "This notebook implements an enhanced RAG approach using document augmentation through question generation. By generating relevant questions for each text chunk, we improve the retrieval process, leading to better responses from the language model.\n",
    "\n",
    "In this implementation, we follow these steps:\n",
    "\n",
    "1. **Data Ingestion**: Extract text from a PDF file.\n",
    "2. **Chunking**: Split the text into manageable chunks.\n",
    "3. **Question Generation**: Generate relevant questions for each chunk.\n",
    "4. **Embedding Creation**: Create embeddings for both chunks and generated questions.\n",
    "5. **Vector Store Creation**: Build a simple vector store using NumPy.\n",
    "6. **Semantic Search**: Retrieve relevant chunks and questions for user queries.\n",
    "7. **Response Generation**: Generate answers based on retrieved content.\n",
    "8. **Evaluation**: Assess the quality of the generated responses."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting Up the Environment\n",
    "We begin by importing necessary libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import fitz\n",
    "import os\n",
    "import numpy as np\n",
    "import json\n",
    "from openai import OpenAI\n",
    "import re\n",
    "from tqdm import tqdm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extracting Text from a PDF File\n",
    "To implement RAG, we first need a source of textual data. In this case, we extract text from a PDF file using the PyMuPDF library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_text_from_pdf(pdf_path):\n",
    "    \"\"\"\n",
    "    Extracts text from a PDF file and prints the first `num_chars` characters.\n",
    "\n",
    "    Args:\n",
    "    pdf_path (str): Path to the PDF file.\n",
    "\n",
    "    Returns:\n",
    "    str: Extracted text from the PDF.\n",
    "    \"\"\"\n",
    "    # Open the PDF file\n",
    "    mypdf = fitz.open(pdf_path)\n",
    "    all_text = \"\"  # Initialize an empty string to store the extracted text\n",
    "\n",
    "    # Iterate through each page in the PDF\n",
    "    for page_num in range(mypdf.page_count):\n",
    "        page = mypdf[page_num]  # Get the page\n",
    "        text = page.get_text(\"text\")  # Extract text from the page\n",
    "        all_text += text  # Append the extracted text to the all_text string\n",
    "\n",
    "    return all_text  # Return the extracted text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chunking the Extracted Text\n",
    "Once we have the extracted text, we divide it into smaller, overlapping chunks to improve retrieval accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def chunk_text(text, n, overlap):\n",
    "    \"\"\"\n",
    "    Chunks the given text into segments of n characters with overlap.\n",
    "\n",
    "    Args:\n",
    "    text (str): The text to be chunked.\n",
    "    n (int): The number of characters in each chunk.\n",
    "    overlap (int): The number of overlapping characters between chunks.\n",
    "\n",
    "    Returns:\n",
    "    List[str]: A list of text chunks.\n",
    "    \"\"\"\n",
    "    chunks = []  # Initialize an empty list to store the chunks\n",
    "    \n",
    "    # Loop through the text with a step size of (n - overlap)\n",
    "    for i in range(0, len(text), n - overlap):\n",
    "        # Append a chunk of text from index i to i + n to the chunks list\n",
    "        chunks.append(text[i:i + n])\n",
    "\n",
    "    return chunks  # Return the list of text chunks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting Up the OpenAI API Client\n",
    "We initialize the OpenAI client to generate embeddings and responses."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize the OpenAI client with the base URL and API key\n",
    "client = OpenAI(\n",
    "    base_url=\"https://api.studio.nebius.com/v1/\",\n",
    "    api_key=os.getenv(\"OPENAI_API_KEY\")  # Retrieve the API key from environment variables\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating Questions for Text Chunks\n",
    "This is the key enhancement over simple RAG. We generate questions that could be answered by each text chunk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_questions(text_chunk, num_questions=5, model=\"meta-llama/Llama-3.2-3B-Instruct\"):\n",
    "    \"\"\"\n",
    "    Generates relevant questions that can be answered from the given text chunk.\n",
    "\n",
    "    Args:\n",
    "    text_chunk (str): The text chunk to generate questions from.\n",
    "    num_questions (int): Number of questions to generate.\n",
    "    model (str): The model to use for question generation.\n",
    "\n",
    "    Returns:\n",
    "    List[str]: List of generated questions.\n",
    "    \"\"\"\n",
    "    # Define the system prompt to guide the AI's behavior\n",
    "    system_prompt = \"You are an expert at generating relevant questions from text. Create concise questions that can be answered using only the provided text. Focus on key information and concepts.\"\n",
    "    \n",
    "    # Define the user prompt with the text chunk and the number of questions to generate\n",
    "    user_prompt = f\"\"\"\n",
    "    Based on the following text, generate {num_questions} different questions that can be answered using only this text:\n",
    "\n",
    "    {text_chunk}\n",
    "    \n",
    "    Format your response as a numbered list of questions only, with no additional text.\n",
    "    \"\"\"\n",
    "    \n",
    "    # Generate questions using the OpenAI API\n",
    "    response = client.chat.completions.create(\n",
    "        model=model,\n",
    "        temperature=0.7,\n",
    "        messages=[\n",
    "            {\"role\": \"system\", \"content\": system_prompt},\n",
    "            {\"role\": \"user\", \"content\": user_prompt}\n",
    "        ]\n",
    "    )\n",
    "    \n",
    "    # Extract and clean questions from the response\n",
    "    questions_text = response.choices[0].message.content.strip()\n",
    "    questions = []\n",
    "    \n",
    "    # Extract questions using regex pattern matching\n",
    "    for line in questions_text.split('\\n'):\n",
    "        # Remove numbering and clean up whitespace\n",
    "        cleaned_line = re.sub(r'^\\d+\\.\\s*', '', line.strip())\n",
    "        if cleaned_line and cleaned_line.endswith('?'):\n",
    "            questions.append(cleaned_line)\n",
    "    \n",
    "    return questions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Embeddings for Text\n",
    "We generate embeddings for both text chunks and generated questions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_embeddings(text, model=\"BAAI/bge-en-icl\"):\n",
    "    \"\"\"\n",
    "    Creates embeddings for the given text using the specified OpenAI model.\n",
    "\n",
    "    Args:\n",
    "    text (str): The input text for which embeddings are to be created.\n",
    "    model (str): The model to be used for creating embeddings.\n",
    "\n",
    "    Returns:\n",
    "    dict: The response from the OpenAI API containing the embeddings.\n",
    "    \"\"\"\n",
    "    # Create embeddings for the input text using the specified model\n",
    "    response = client.embeddings.create(\n",
    "        model=model,\n",
    "        input=text\n",
    "    )\n",
    "\n",
    "    return response  # Return the response containing the embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building a Simple Vector Store\n",
    "We'll implement a simple vector store using NumPy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "class SimpleVectorStore:\n",
    "    \"\"\"\n",
    "    A simple vector store implementation using NumPy.\n",
    "    \"\"\"\n",
    "    def __init__(self):\n",
    "        \"\"\"\n",
    "        Initialize the vector store.\n",
    "        \"\"\"\n",
    "        self.vectors = []\n",
    "        self.texts = []\n",
    "        self.metadata = []\n",
    "    \n",
    "    def add_item(self, text, embedding, metadata=None):\n",
    "        \"\"\"\n",
    "        Add an item to the vector store.\n",
    "\n",
    "        Args:\n",
    "        text (str): The original text.\n",
    "        embedding (List[float]): The embedding vector.\n",
    "        metadata (dict, optional): Additional metadata.\n",
    "        \"\"\"\n",
    "        self.vectors.append(np.array(embedding))\n",
    "        self.texts.append(text)\n",
    "        self.metadata.append(metadata or {})\n",
    "    \n",
    "    def similarity_search(self, query_embedding, k=5):\n",
    "        \"\"\"\n",
    "        Find the most similar items to a query embedding.\n",
    "\n",
    "        Args:\n",
    "        query_embedding (List[float]): Query embedding vector.\n",
    "        k (int): Number of results to return.\n",
    "\n",
    "        Returns:\n",
    "        List[Dict]: Top k most similar items with their texts and metadata.\n",
    "        \"\"\"\n",
    "        if not self.vectors:\n",
    "            return []\n",
    "        \n",
    "        # Convert query embedding to numpy array\n",
    "        query_vector = np.array(query_embedding)\n",
    "        \n",
    "        # Calculate similarities using cosine similarity\n",
    "        similarities = []\n",
    "        for i, vector in enumerate(self.vectors):\n",
    "            similarity = np.dot(query_vector, vector) / (np.linalg.norm(query_vector) * np.linalg.norm(vector))\n",
    "            similarities.append((i, similarity))\n",
    "        \n",
    "        # Sort by similarity (descending)\n",
    "        similarities.sort(key=lambda x: x[1], reverse=True)\n",
    "        \n",
    "        # Return top k results\n",
    "        results = []\n",
    "        for i in range(min(k, len(similarities))):\n",
    "            idx, score = similarities[i]\n",
    "            results.append({\n",
    "                \"text\": self.texts[idx],\n",
    "                \"metadata\": self.metadata[idx],\n",
    "                \"similarity\": score\n",
    "            })\n",
    "        \n",
    "        return results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Processing Documents with Question Augmentation\n",
    "Now we'll put everything together to process documents, generate questions, and build our augmented vector store."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def process_document(pdf_path, chunk_size=1000, chunk_overlap=200, questions_per_chunk=5):\n",
    "    \"\"\"\n",
    "    Process a document with question augmentation.\n",
    "\n",
    "    Args:\n",
    "    pdf_path (str): Path to the PDF file.\n",
    "    chunk_size (int): Size of each text chunk in characters.\n",
    "    chunk_overlap (int): Overlap between chunks in characters.\n",
    "    questions_per_chunk (int): Number of questions to generate per chunk.\n",
    "\n",
    "    Returns:\n",
    "    Tuple[List[str], SimpleVectorStore]: Text chunks and vector store.\n",
    "    \"\"\"\n",
    "    print(\"Extracting text from PDF...\")\n",
    "    extracted_text = extract_text_from_pdf(pdf_path)\n",
    "    \n",
    "    print(\"Chunking text...\")\n",
    "    text_chunks = chunk_text(extracted_text, chunk_size, chunk_overlap)\n",
    "    print(f\"Created {len(text_chunks)} text chunks\")\n",
    "    \n",
    "    vector_store = SimpleVectorStore()\n",
    "    \n",
    "    print(\"Processing chunks and generating questions...\")\n",
    "    for i, chunk in enumerate(tqdm(text_chunks, desc=\"Processing Chunks\")):\n",
    "        # Create embedding for the chunk itself\n",
    "        chunk_embedding_response = create_embeddings(chunk)\n",
    "        chunk_embedding = chunk_embedding_response.data[0].embedding\n",
    "        \n",
    "        # Add the chunk to the vector store\n",
    "        vector_store.add_item(\n",
    "            text=chunk,\n",
    "            embedding=chunk_embedding,\n",
    "            metadata={\"type\": \"chunk\", \"index\": i}\n",
    "        )\n",
    "        \n",
    "        # Generate questions for this chunk\n",
    "        questions = generate_questions(chunk, num_questions=questions_per_chunk)\n",
    "        \n",
    "        # Create embeddings for each question and add to vector store\n",
    "        for j, question in enumerate(questions):\n",
    "            question_embedding_response = create_embeddings(question)\n",
    "            question_embedding = question_embedding_response.data[0].embedding\n",
    "            \n",
    "            # Add the question to the vector store\n",
    "            vector_store.add_item(\n",
    "                text=question,\n",
    "                embedding=question_embedding,\n",
    "                metadata={\"type\": \"question\", \"chunk_index\": i, \"original_chunk\": chunk}\n",
    "            )\n",
    "    \n",
    "    return text_chunks, vector_store"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extracting and Processing the Document"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Extracting text from PDF...\n",
      "Chunking text...\n",
      "Created 42 text chunks\n",
      "Processing chunks and generating questions...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Processing Chunks: 100%|██████████| 42/42 [01:30<00:00,  2.15s/it]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Vector store contains 165 items\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "# Define the path to the PDF file\n",
    "pdf_path = \"data/AI_Information.pdf\"\n",
    "\n",
    "# Process the document (extract text, create chunks, generate questions, build vector store)\n",
    "text_chunks, vector_store = process_document(\n",
    "    pdf_path, \n",
    "    chunk_size=1000, \n",
    "    chunk_overlap=200, \n",
    "    questions_per_chunk=3\n",
    ")\n",
    "\n",
    "print(f\"Vector store contains {len(vector_store.texts)} items\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Performing Semantic Search\n",
    "We implement a semantic search function similar to the simple RAG implementation but adapted to our augmented vector store."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def semantic_search(query, vector_store, k=5):\n",
    "    \"\"\"\n",
    "    Performs semantic search using the query and vector store.\n",
    "\n",
    "    Args:\n",
    "    query (str): The search query.\n",
    "    vector_store (SimpleVectorStore): The vector store to search in.\n",
    "    k (int): Number of results to return.\n",
    "\n",
    "    Returns:\n",
    "    List[Dict]: Top k most relevant items.\n",
    "    \"\"\"\n",
    "    # Create embedding for the query\n",
    "    query_embedding_response = create_embeddings(query)\n",
    "    query_embedding = query_embedding_response.data[0].embedding\n",
    "    \n",
    "    # Search the vector store\n",
    "    results = vector_store.similarity_search(query_embedding, k=k)\n",
    "    \n",
    "    return results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running a Query on the Augmented Vector Store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Query: What is 'Explainable AI' and why is it considered important?\n",
      "\n",
      "Search Results:\n",
      "\n",
      "Relevant Document Chunks:\n",
      "\n",
      "Matched Questions:\n",
      "Question 1 (similarity: 0.8629):\n",
      "What is the main goal of Explainable AI (XAI)?\n",
      "From chunk 10\n",
      "=====================================\n",
      "Question 2 (similarity: 0.8488):\n",
      "What is the primary goal of Explainable AI (XAI) techniques?\n",
      "From chunk 37\n",
      "=====================================\n",
      "Question 3 (similarity: 0.8414):\n",
      "What is the focus of research on Explainable AI (XAI)?\n",
      "From chunk 29\n",
      "=====================================\n",
      "Question 4 (similarity: 0.7995):\n",
      "Why are transparency and explainability essential for building trust in AI systems?\n",
      "From chunk 36\n",
      "=====================================\n",
      "Question 5 (similarity: 0.7841):\n",
      "Why is transparency and explainability essential in building trust and accountability with AI systems?\n",
      "From chunk 9\n",
      "=====================================\n"
     ]
    }
   ],
   "source": [
    "# Load the validation data from a JSON file\n",
    "with open('data/val.json') as f:\n",
    "    data = json.load(f)\n",
    "\n",
    "# Extract the first query from the validation data\n",
    "query = data[0]['question']\n",
    "\n",
    "# Perform semantic search to find relevant content\n",
    "search_results = semantic_search(query, vector_store, k=5)\n",
    "\n",
    "print(\"Query:\", query)\n",
    "print(\"\\nSearch Results:\")\n",
    "\n",
    "# Organize results by type\n",
    "chunk_results = []\n",
    "question_results = []\n",
    "\n",
    "for result in search_results:\n",
    "    if result[\"metadata\"][\"type\"] == \"chunk\":\n",
    "        chunk_results.append(result)\n",
    "    else:\n",
    "        question_results.append(result)\n",
    "\n",
    "# Print chunk results first\n",
    "print(\"\\nRelevant Document Chunks:\")\n",
    "for i, result in enumerate(chunk_results):\n",
    "    print(f\"Context {i + 1} (similarity: {result['similarity']:.4f}):\")\n",
    "    print(result[\"text\"][:300] + \"...\")\n",
    "    print(\"=====================================\")\n",
    "\n",
    "# Then print question matches\n",
    "print(\"\\nMatched Questions:\")\n",
    "for i, result in enumerate(question_results):\n",
    "    print(f\"Question {i + 1} (similarity: {result['similarity']:.4f}):\")\n",
    "    print(result[\"text\"])\n",
    "    chunk_idx = result[\"metadata\"][\"chunk_index\"]\n",
    "    print(f\"From chunk {chunk_idx}\")\n",
    "    print(\"=====================================\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating Context for Response\n",
    "Now we prepare the context by combining information from relevant chunks and questions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def prepare_context(search_results):\n",
    "    \"\"\"\n",
    "    Prepares a unified context from search results for response generation.\n",
    "\n",
    "    Args:\n",
    "    search_results (List[Dict]): Results from semantic search.\n",
    "\n",
    "    Returns:\n",
    "    str: Combined context string.\n",
    "    \"\"\"\n",
    "    # Extract unique chunks referenced in the results\n",
    "    chunk_indices = set()\n",
    "    context_chunks = []\n",
    "    \n",
    "    # First add direct chunk matches\n",
    "    for result in search_results:\n",
    "        if result[\"metadata\"][\"type\"] == \"chunk\":\n",
    "            chunk_indices.add(result[\"metadata\"][\"index\"])\n",
    "            context_chunks.append(f\"Chunk {result['metadata']['index']}:\\n{result['text']}\")\n",
    "    \n",
    "    # Then add chunks referenced by questions\n",
    "    for result in search_results:\n",
    "        if result[\"metadata\"][\"type\"] == \"question\":\n",
    "            chunk_idx = result[\"metadata\"][\"chunk_index\"]\n",
    "            if chunk_idx not in chunk_indices:\n",
    "                chunk_indices.add(chunk_idx)\n",
    "                context_chunks.append(f\"Chunk {chunk_idx} (referenced by question '{result['text']}'):\\n{result['metadata']['original_chunk']}\")\n",
    "    \n",
    "    # Combine all context chunks\n",
    "    full_context = \"\\n\\n\".join(context_chunks)\n",
    "    return full_context"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating a Response Based on Retrieved Chunks\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_response(query, context, model=\"meta-llama/Llama-3.2-3B-Instruct\"):\n",
    "    \"\"\"\n",
    "    Generates a response based on the query and context.\n",
    "\n",
    "    Args:\n",
    "    query (str): User's question.\n",
    "    context (str): Context information retrieved from the vector store.\n",
    "    model (str): Model to use for response generation.\n",
    "\n",
    "    Returns:\n",
    "    str: Generated response.\n",
    "    \"\"\"\n",
    "    system_prompt = \"You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'\"\n",
    "    \n",
    "    user_prompt = f\"\"\"\n",
    "        Context:\n",
    "        {context}\n",
    "\n",
    "        Question: {query}\n",
    "\n",
    "        Please answer the question based only on the context provided above. Be concise and accurate.\n",
    "    \"\"\"\n",
    "    \n",
    "    response = client.chat.completions.create(\n",
    "        model=model,\n",
    "        temperature=0,\n",
    "        messages=[\n",
    "            {\"role\": \"system\", \"content\": system_prompt},\n",
    "            {\"role\": \"user\", \"content\": user_prompt}\n",
    "        ]\n",
    "    )\n",
    "    \n",
    "    return response.choices[0].message.content"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating and Displaying the Response"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Query: What is 'Explainable AI' and why is it considered important?\n",
      "\n",
      "Response:\n",
      "Explainable AI (XAI) is a field that aims to make AI systems more transparent and understandable by providing insights into how AI models make decisions. This is essential for building trust and accountability in AI systems, as it enables users to assess their fairness and accuracy. XAI techniques are crucial for addressing potential harms, ensuring ethical behavior, and establishing clear guidelines and ethical frameworks for AI development and deployment.\n"
     ]
    }
   ],
   "source": [
    "# Prepare context from search results\n",
    "context = prepare_context(search_results)\n",
    "\n",
    "# Generate response\n",
    "response_text = generate_response(query, context)\n",
    "\n",
    "print(\"\\nQuery:\", query)\n",
    "print(\"\\nResponse:\")\n",
    "print(response_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluating the AI Response\n",
    "We compare the AI response with the expected answer and assign a score."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "def evaluate_response(query, response, reference_answer, model=\"meta-llama/Llama-3.2-3B-Instruct\"):\n",
    "    \"\"\"\n",
    "    Evaluates the AI response against a reference answer.\n",
    "    \n",
    "    Args:\n",
    "    query (str): The user's question.\n",
    "    response (str): The AI-generated response.\n",
    "    reference_answer (str): The reference/ideal answer.\n",
    "    model (str): Model to use for evaluation.\n",
    "    \n",
    "    Returns:\n",
    "    str: Evaluation feedback.\n",
    "    \"\"\"\n",
    "    # Define the system prompt for the evaluation system\n",
    "    evaluate_system_prompt = \"\"\"You are an intelligent evaluation system tasked with assessing AI responses.\n",
    "            \n",
    "        Compare the AI assistant's response to the true/reference answer, and evaluate based on:\n",
    "        1. Factual correctness - Does the response contain accurate information?\n",
    "        2. Completeness - Does it cover all important aspects from the reference?\n",
    "        3. Relevance - Does it directly address the question?\n",
    "\n",
    "        Assign a score from 0 to 1:\n",
    "        - 1.0: Perfect match in content and meaning\n",
    "        - 0.8: Very good, with minor omissions/differences\n",
    "        - 0.6: Good, covers main points but misses some details\n",
    "        - 0.4: Partial answer with significant omissions\n",
    "        - 0.2: Minimal relevant information\n",
    "        - 0.0: Incorrect or irrelevant\n",
    "\n",
    "        Provide your score with justification.\n",
    "    \"\"\"\n",
    "            \n",
    "    # Create the evaluation prompt\n",
    "    evaluation_prompt = f\"\"\"\n",
    "        User Query: {query}\n",
    "\n",
    "        AI Response:\n",
    "        {response}\n",
    "\n",
    "        Reference Answer:\n",
    "        {reference_answer}\n",
    "\n",
    "        Please evaluate the AI response against the reference answer.\n",
    "    \"\"\"\n",
    "    \n",
    "    # Generate evaluation\n",
    "    eval_response = client.chat.completions.create(\n",
    "        model=model,\n",
    "        temperature=0,\n",
    "        messages=[\n",
    "            {\"role\": \"system\", \"content\": evaluate_system_prompt},\n",
    "            {\"role\": \"user\", \"content\": evaluation_prompt}\n",
    "        ]\n",
    "    )\n",
    "    \n",
    "    return eval_response.choices[0].message.content"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running the Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Evaluation:\n",
      "Based on the evaluation criteria, I will assess the AI response as follows:\n",
      "\n",
      "1. Factual correctness: The AI response contains accurate information about Explainable AI (XAI) and its importance. It correctly states that XAI aims to make AI systems more transparent and understandable, providing insights into how they make decisions.\n",
      "\n",
      "2. Completeness: The AI response covers the main points of XAI, including its importance for building trust, accountability, and ensuring fairness in AI systems. However, it misses some details, such as the potential harms that XAI can address and the need for clear guidelines and ethical frameworks for AI development and deployment.\n",
      "\n",
      "3. Relevance: The AI response directly addresses the question, providing a clear and concise explanation of XAI and its significance.\n",
      "\n",
      "Based on the evaluation, I would assign a score of 0.8 to the AI response. The response is very good, with minor omissions and differences from the reference answer. It covers the main points of XAI and its importance, but misses some details that are present in the reference answer.\n"
     ]
    }
   ],
   "source": [
    "# Get reference answer from validation data\n",
    "reference_answer = data[0]['ideal_answer']\n",
    "\n",
    "# Evaluate the response\n",
    "evaluation = evaluate_response(query, response_text, reference_answer)\n",
    "\n",
    "print(\"\\nEvaluation:\")\n",
    "print(evaluation)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extracting and Chunking Text from a PDF File\n",
    "Now, we load the PDF, extract text, and split it into chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of text chunks: 42\n",
      "\n",
      "First text chunk:\n",
      "Understanding Artificial Intelligence \n",
      "Chapter 1: Introduction to Artificial Intelligence \n",
      "Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot \n",
      "to perform tasks commonly associated with intelligent beings. The term is frequently applied to \n",
      "the project of developing systems endowed with the intellectual processes characteristic of \n",
      "humans, such as the ability to reason, discover meaning, generalize, or learn from past \n",
      "experience. Over the past few decades, advancements in computing power and data availability \n",
      "have significantly accelerated the development and deployment of AI. \n",
      "Historical Context \n",
      "The idea of artificial intelligence has existed for centuries, often depicted in myths and fiction. \n",
      "However, the formal field of AI research began in the mid-20th century. The Dartmouth Workshop \n",
      "in 1956 is widely considered the birthplace of AI. Early AI research focused on problem-solving \n",
      "and symbolic methods. The 1980s saw a rise in exp\n"
     ]
    }
   ],
   "source": [
    "# Define the path to the PDF file\n",
    "pdf_path = \"data/AI_Information.pdf\"\n",
    "\n",
    "# Extract text from the PDF file\n",
    "extracted_text = extract_text_from_pdf(pdf_path)\n",
    "\n",
    "# Chunk the extracted text into segments of 1000 characters with an overlap of 200 characters\n",
    "text_chunks = chunk_text(extracted_text, 1000, 200)\n",
    "\n",
    "# Print the number of text chunks created\n",
    "print(\"Number of text chunks:\", len(text_chunks))\n",
    "\n",
    "# Print the first text chunk\n",
    "print(\"\\nFirst text chunk:\")\n",
    "print(text_chunks[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Embeddings for Text Chunks\n",
    "Embeddings transform text into numerical vectors, which allow for efficient similarity search."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_embeddings(text, model=\"BAAI/bge-en-icl\"):\n",
    "    \"\"\"\n",
    "    Creates embeddings for the given text using the specified OpenAI model.\n",
    "\n",
    "    Args:\n",
    "    text (str): The input text for which embeddings are to be created.\n",
    "    model (str): The model to be used for creating embeddings. Default is \"BAAI/bge-en-icl\".\n",
    "\n",
    "    Returns:\n",
    "    dict: The response from the OpenAI API containing the embeddings.\n",
    "    \"\"\"\n",
    "    # Create embeddings for the input text using the specified model\n",
    "    response = client.embeddings.create(\n",
    "        model=model,\n",
    "        input=text\n",
    "    )\n",
    "\n",
    "    return response  # Return the response containing the embeddings\n",
    "\n",
    "# Create embeddings for the text chunks\n",
    "response = create_embeddings(text_chunks)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Performing Semantic Search\n",
    "We implement cosine similarity to find the most relevant text chunks for a user query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cosine_similarity(vec1, vec2):\n",
    "    \"\"\"\n",
    "    Calculates the cosine similarity between two vectors.\n",
    "\n",
    "    Args:\n",
    "    vec1 (np.ndarray): The first vector.\n",
    "    vec2 (np.ndarray): The second vector.\n",
    "\n",
    "    Returns:\n",
    "    float: The cosine similarity between the two vectors.\n",
    "    \"\"\"\n",
    "    # Compute the dot product of the two vectors and divide by the product of their norms\n",
    "    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "def semantic_search(query, text_chunks, embeddings, k=5):\n",
    "    \"\"\"\n",
    "    Performs semantic search on the text chunks using the given query and embeddings.\n",
    "\n",
    "    Args:\n",
    "    query (str): The query for the semantic search.\n",
    "    text_chunks (List[str]): A list of text chunks to search through.\n",
    "    embeddings (List[dict]): A list of embeddings for the text chunks.\n",
    "    k (int): The number of top relevant text chunks to return. Default is 5.\n",
    "\n",
    "    Returns:\n",
    "    List[str]: A list of the top k most relevant text chunks based on the query.\n",
    "    \"\"\"\n",
    "    # Create an embedding for the query\n",
    "    query_embedding = create_embeddings(query).data[0].embedding\n",
    "    similarity_scores = []  # Initialize a list to store similarity scores\n",
    "\n",
    "    # Calculate similarity scores between the query embedding and each text chunk embedding\n",
    "    for i, chunk_embedding in enumerate(embeddings):\n",
    "        similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding.embedding))\n",
    "        similarity_scores.append((i, similarity_score))  # Append the index and similarity score\n",
    "\n",
    "    # Sort the similarity scores in descending order\n",
    "    similarity_scores.sort(key=lambda x: x[1], reverse=True)\n",
    "    # Get the indices of the top k most similar text chunks\n",
    "    top_indices = [index for index, _ in similarity_scores[:k]]\n",
    "    # Return the top k most relevant text chunks\n",
    "    return [text_chunks[index] for index in top_indices]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running a Query on Extracted Chunks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Query: What is 'Explainable AI' and why is it considered important?\n",
      "Context 1:\n",
      "systems. Explainable AI (XAI) \n",
      "techniques aim to make AI decisions more understandable, enabling users to assess their \n",
      "fairness and accuracy. \n",
      "Privacy and Data Protection \n",
      "AI systems often rely on large amounts of data, raising concerns about privacy and data \n",
      "protection. Ensuring responsible data handling, implementing privacy-preserving techniques, \n",
      "and complying with data protection regulations are crucial. \n",
      "Accountability and Responsibility \n",
      "Establishing accountability and responsibility for AI systems is essential for addressing potential \n",
      "harms and ensuring ethical behavior. This includes defining roles and responsibilities for \n",
      "developers, deployers, and users of AI systems. \n",
      "Chapter 20: Building Trust in AI \n",
      "Transparency and Explainability \n",
      "Transparency and explainability are key to building trust in AI. Making AI systems understandable \n",
      "and providing insights into their decision-making processes helps users assess their reliability \n",
      "and fairness. \n",
      "Robustness and Reliability \n",
      "\n",
      "=====================================\n",
      "Context 2:\n",
      " incidents. \n",
      "Environmental Monitoring \n",
      "AI-powered environmental monitoring systems track air and water quality, detect pollution, and \n",
      "support environmental protection efforts. These systems provide real-time data, identify \n",
      "pollution sources, and inform environmental policies. \n",
      "Chapter 15: The Future of AI Research \n",
      "Advancements in Deep Learning \n",
      "Continued advancements in deep learning are expected to drive further breakthroughs in AI. \n",
      "Research is focused on developing more efficient and interpretable deep learning models, as well \n",
      "as exploring new architectures and training techniques. \n",
      "Explainable AI (XAI) \n",
      "Explainable AI (XAI) aims to make AI systems more transparent and understandable. Research in \n",
      "XAI focuses on developing methods for explaining AI decisions, enhancing trust, and improving \n",
      "accountability. \n",
      "AI and Neuroscience \n",
      "The intersection of AI and neuroscience is a promising area of research. Understanding the \n",
      "human brain can inspire new AI algorithms and architectures, \n",
      "=====================================\n"
     ]
    }
   ],
   "source": [
    "# Load the validation data from a JSON file\n",
    "with open('data/val.json') as f:\n",
    "    data = json.load(f)\n",
    "\n",
    "# Extract the first query from the validation data\n",
    "query = data[0]['question']\n",
    "\n",
    "# Perform semantic search to find the top 2 most relevant text chunks for the query\n",
    "top_chunks = semantic_search(query, text_chunks, response.data, k=2)\n",
    "\n",
    "# Print the query\n",
    "print(\"Query:\", query)\n",
    "\n",
    "# Print the top 2 most relevant text chunks\n",
    "for i, chunk in enumerate(top_chunks):\n",
    "    print(f\"Context {i + 1}:\\n{chunk}\\n=====================================\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating a Response Based on Retrieved Chunks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the system prompt for the AI assistant\n",
    "system_prompt = \"You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'\"\n",
    "\n",
    "def generate_response(system_prompt, user_message, model=\"meta-llama/Llama-3.2-3B-Instruct\"):\n",
    "    \"\"\"\n",
    "    Generates a response from the AI model based on the system prompt and user message.\n",
    "\n",
    "    Args:\n",
    "    system_prompt (str): The system prompt to guide the AI's behavior.\n",
    "    user_message (str): The user's message or query.\n",
    "    model (str): The model to be used for generating the response. Default is \"meta-llama/Llama-2-7B-chat-hf\".\n",
    "\n",
    "    Returns:\n",
    "    dict: The response from the AI model.\n",
    "    \"\"\"\n",
    "    response = client.chat.completions.create(\n",
    "        model=model,\n",
    "        temperature=0,\n",
    "        messages=[\n",
    "            {\"role\": \"system\", \"content\": system_prompt},\n",
    "            {\"role\": \"user\", \"content\": user_message}\n",
    "        ]\n",
    "    )\n",
    "    return response\n",
    "\n",
    "# Create the user prompt based on the top chunks\n",
    "user_prompt = \"\\n\".join([f\"Context {i + 1}:\\n{chunk}\\n=====================================\\n\" for i, chunk in enumerate(top_chunks)])\n",
    "user_prompt = f\"{user_prompt}\\nQuestion: {query}\"\n",
    "\n",
    "# Generate AI response\n",
    "ai_response = generate_response(system_prompt, user_prompt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluating the AI Response\n",
    "We compare the AI response with the expected answer and assign a score."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Based on the evaluation criteria, I would assign a score of 0.8 to the AI assistant's response.\n",
      "\n",
      "The response is very close to the true response, and it correctly conveys the main idea of Explainable AI (XAI) and its importance. The AI assistant has successfully identified the primary goal of XAI, its significance in building trust and accountability, and its relevance to areas such as privacy and data protection.\n",
      "\n",
      "However, the response could be improved by providing more specific details and examples to support the claims made. For instance, the AI assistant could have elaborated on the techniques used in XAI, such as model interpretability, feature attribution, and explainability metrics. Additionally, the response could have provided more concrete examples of how XAI is being applied in various fields, such as healthcare and finance.\n",
      "\n",
      "Overall, the response is a good start, but it could benefit from more depth and specificity to make it more accurate and informative.\n"
     ]
    }
   ],
   "source": [
    "# Define the system prompt for the evaluation system\n",
    "evaluate_system_prompt = \"You are an intelligent evaluation system tasked with assessing the AI assistant's responses. If the AI assistant's response is very close to the true response, assign a score of 1. If the response is incorrect or unsatisfactory in relation to the true response, assign a score of 0. If the response is partially aligned with the true response, assign a score of 0.5.\"\n",
    "\n",
    "# Create the evaluation prompt by combining the user query, AI response, true response, and evaluation system prompt\n",
    "evaluation_prompt = f\"User Query: {query}\\nAI Response:\\n{ai_response.choices[0].message.content}\\nTrue Response: {data[0]['ideal_answer']}\\n{evaluate_system_prompt}\"\n",
    "\n",
    "# Generate the evaluation response using the evaluation system prompt and evaluation prompt\n",
    "evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)\n",
    "\n",
    "# Print the evaluation response\n",
    "print(evaluation_response.choices[0].message.content)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv-new-specific-rag",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
